Computational Training: running fastR

Andrea Brizzi

2024-12-12

Great R resources

In general:

  1. Advanced R
  2. R for Data Science
  3. RStudio Cheatsheets
  4. Useful R packages

Covering this session’s topics

  1. data.table
  2. parallel

What is “fast code”?

  1. Easy to write
    familiarity with code editor, libraries

  2. Easy to understand
    structured, with consistent variable names, commented.

  3. Easy to debug
    clear naming, DRY, tests.

  4. Easy to run
    🏎️ (profiling, C++, using “optimized” code).

Variable naming

names should be consistent, descriptive, lower case, readable.

For which snippet is it easier to guess the context?

tmp <-  10
tmp1 <- tmp * 24
cases_per_hour <- 10
cases_per_day <- cases_per_hour * 24

Embrace functional programming I

Functions are first-class citizens in R

  • can be passed as arguments
  • can be returned from other functions
  • can be assigned to variables
  • and more…
first-class-citizenship.R
f <- function(x){x^2}

lapply(1:10, f)

generator <- function(n=2){
    function(x){x^n}
}
cube <- generator(3)

list(one_function = f)

Embrace functional programming II

Rethink for,while loops; “apply” instead

“To become significantly more reliable, code must become more transparent. In particular, nested conditions and loops must be viewed with great suspicion. Complicated control flows confuse programmers. Messy code often hides bugs.”

— Bjarne Stroustrup

…but why?

Say you want to extract the \(R^2\) from three linear models with different predictors (or formulae).

formulae.R
formulae <- c(
    Sepal.Length ~ Sepal.Width,
    Sepal.Length ~ Petal.Length,
    Sepal.Length ~ Species
)
bad way
lm_results2 <- c()

for (formula in formulae) {
    fit <- lm(formula, data = iris)
    r2 <- summary(fit)$r.squared
    lm_results2 <- c(lm_results2, r2)
}
good way
extract_r2 <- function(formula) {
    fit <- lm(formula, data = iris)
    r2 <- summary(fit)$r.squared
    return(r2)
}

lm_results <- sapply(formulae, extract_r2)
What’s the difference?
side effects
exists("fit")

The parallel package

You can imagine wanting to run each of the apply/for loop iterations in parallel.

sockets (not on Windows)
library(parallel)
f <- function(i) {
    lme4::lmer(
        Petal.Width ~ . - Species + (1 | Species),
        data = iris)
}

system.time(save1 <- lapply(1:100, f))
##    user  system elapsed
##   2.048   0.019   2.084
system.time(save2 <- mclapply(1:100, f))
##    user  system elapsed
##   1.295   0.150   1.471
forking
num_cores <- detectCores()
cl <- makeCluster(num_cores)
system.time(save3 <- parLapply(cl, 1:100, f))
#    user  system elapsed 
#   0.198   0.044   1.032 
stopCluster(cl)
  • requires further attention

Introduction to data.table

  • data.table is a package that extends the data.frame class.
  • (Quicker) alternative to dplyr for large datasets.
  • cheatsheet

all you need to know

dt[i, j, by]

  • use the data.table called dt
  • subset it on the rows specified i
  • and manipulate columns with j
  • grouped according to by.

Basic Examples with iris

Example 1: Subsetting and Summarizing

  • explanation here.
library(data.table)
dt <- as.data.table(iris)
dt[Species == "setosa", mean(Sepal.Length)]

Example 2: Grouping and Aggregating

  • explanation here.
library(data.table)
dt <- as.data.table(iris)
dt[Species == "setosa", mean(Sepal.Length)]

Profiling

  • Identify (and hopefully fix!) bottlenecks in your code.
  • The profvis package is a good package to use for this purpose.

Profiling Example: Column Means

library(profvis)
library(data.table)
n <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(n * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(n)), data)
dataDF <- as.data.table(data)
numeric_vars <- setdiff(names(data), "id")

profvis({
  means <- apply(data[, names(data) != "id"], 2, mean)
  means <- colMeans(data[, names(data) != "id"])
  means <- lapply(data[, names(data) != "id"], mean)
  means <- vapply(data[, names(data) != "id"], mean, numeric(1))
  means <- matrixStats::colMeans2(as.matrix(data[, names(data) != "id"]))
  means <- dataDF[, lapply(.SD, mean), .SDcols = numeric_vars]
})

Profiling Example: Column Means

Profiling